This a loan dataset from Prosper. I choose 15 variables to analysis, which are EstimatedReturn, ProsperScore, Occupation, IsborroweHomeowner, TotalCreditLinespast7years, Totalinquiries, DelinquenciesLast7Years, TotalTrades, DebtToIncomeRatio, IncomeRangee, TotalProsperPaymentsBilled, ProsperPrincipalBorrowed, ProsperPrincipalOutstanding, LoanOriginalAmount, and MonthlyLoanPayment. There are lots of NA values in each column of dataset, which I won’t get rid of from the dataset other than remove them in respective variable analysis to make as large as possible use of the observations.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(grid)
library(SparseM)
# load dataset
pld <- read.csv("D:/Udacity/prosperLoanData.csv")
dim(pld)
## [1] 113937 81
# a dataset including all 16 variables
pld.ana <- pld[, c("EstimatedReturn", "ProsperScore", "Occupation", "IsBorrowerHomeowner",
"TotalCreditLinespast7years", "TotalInquiries", "DelinquenciesLast7Years", "TotalTrades", "DebtToIncomeRatio", "IncomeRange", "TotalProsperPaymentsBilled", "ProsperPrincipalBorrowed", "ProsperPrincipalOutstanding", "LoanOriginalAmount", "MonthlyLoanPayment")]
str(pld.ana)
## 'data.frame': 113937 obs. of 15 variables:
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding: num NA NA NA NA 9948 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
dim(pld.ana)
## [1] 113937 15
# 7 occupations selected
occupation_sel <- factor(c("Accountant/CPA", "Administrative Assistant", "Computer Programmer", "Executive", "Sales - Commission", "Teacher"))
pld.Occupation <- pld.ana[pld.ana$Occupation %in% occupation_sel, ]
# 7 occupation levels
pld.Occupation$Occupation <- factor(pld.Occupation$Occupation)
# variable "ProsperScore" to factor
pld.ana$ProsperScore <- factor(pld.ana$ProsperScore)
The origianl data contains 81 varibales and 113937 observations, and the analysed data contains 16 variables and 113937 observations. The “TotalProsperLoans” factor has 0-8 levels. The “ProsperScore” factor has 0-11 levels.
# data summary
summary(pld.ana)
## EstimatedReturn ProsperScore Occupation
## Min. :-0.183 4 :12595 Other :28617
## 1st Qu.: 0.074 6 :12278 Professional :13628
## Median : 0.092 8 :12053 Computer Programmer : 4478
## Mean : 0.096 7 :10597 Executive : 4311
## 3rd Qu.: 0.117 5 : 9813 Teacher : 3759
## Max. : 0.284 (Other):27517 Administrative Assistant: 3688
## NA's :29084 NA's :29084 (Other) :55456
## IsBorrowerHomeowner TotalCreditLinespast7years TotalInquiries
## False:56459 Min. : 2.00 Min. : 0.000
## True :57478 1st Qu.: 17.00 1st Qu.: 2.000
## Median : 25.00 Median : 4.000
## Mean : 26.75 Mean : 5.584
## 3rd Qu.: 35.00 3rd Qu.: 7.000
## Max. :136.00 Max. :379.000
## NA's :697 NA's :1159
## DelinquenciesLast7Years TotalTrades DebtToIncomeRatio
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 15.00 1st Qu.: 0.140
## Median : 0.000 Median : 22.00 Median : 0.220
## Mean : 4.155 Mean : 23.23 Mean : 0.276
## 3rd Qu.: 3.000 3rd Qu.: 30.00 3rd Qu.: 0.320
## Max. :99.000 Max. :126.00 Max. :10.010
## NA's :990 NA's :7544 NA's :8554
## IncomeRange TotalProsperPaymentsBilled
## $25,000-49,999:32192 Min. : 0.00
## $50,000-74,999:31050 1st Qu.: 9.00
## $100,000+ :17337 Median : 16.00
## $75,000-99,999:16916 Mean : 22.93
## Not displayed : 7741 3rd Qu.: 33.00
## $1-24,999 : 7274 Max. :141.00
## (Other) : 1427 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding LoanOriginalAmount
## Min. : 0 Min. : 0 Min. : 1000
## 1st Qu.: 3500 1st Qu.: 0 1st Qu.: 4000
## Median : 6000 Median : 1627 Median : 6500
## Mean : 8472 Mean : 2930 Mean : 8337
## 3rd Qu.:11000 3rd Qu.: 4127 3rd Qu.:12000
## Max. :72499 Max. :23451 Max. :35000
## NA's :91852 NA's :91852
## MonthlyLoanPayment
## Min. : 0.0
## 1st Qu.: 131.6
## Median : 217.7
## Mean : 272.5
## 3rd Qu.: 371.6
## Max. :2251.5
##
# histograms
range(pld.ana$EstimatedReturn, na.rm = TRUE)
## [1] -0.1827 0.2837
ggplot(aes(EstimatedReturn), data = subset(pld.ana, !is.na(EstimatedReturn))) +
geom_histogram()
The range of EstimatedReturn is -0.1827 upto 0.2837, from the histogram, we can see that most of the return is between 0 and 0.2.
# adjusted histogram
plt_EstimatedReturn <- ggplot(aes(EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn))) + geom_histogram(binwidth = 0.001) +
xlim(0, 0.2)
plt_EstimatedReturn
The most frequent return is arount 0.125, and the distribution seems to be kind of multi-model.
summary(pld.ana$EstimatedReturn)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.074 0.092 0.096 0.117 0.284 29084
# histogram of TotalCreditLinespast7years
range(pld.ana$TotalCreditLinespast7years, na.rm = TRUE)
## [1] 2 136
plt_TotalCreditLinespast7years <- ggplot(aes(TotalCreditLinespast7years), data = subset(pld.ana, !is.na(TotalCreditLinespast7years))) +
geom_histogram(binwidth = .2) +
scale_x_continuous(limits = c(0, 120), breaks = seq(0, 120, 20))
plt_TotalCreditLinespast7years
The mode of TotalCreditLinespast7years is a little bit more than 20, and the distribution is close to normal.
summary(pld.ana$TotalCreditLinespast7years)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 17.00 25.00 26.75 35.00 136.00 697
# histogram of TotalInquiries
range(pld.ana$TotalInquiries, na.rm = TRUE)
## [1] 0 379
plt_TotalInquiries <- ggplot(aes(TotalInquiries), data = subset(pld.ana, !is.na(TotalInquiries))) +
geom_histogram(binwidth = 0.5) +
scale_x_continuous(limits = c(0, 60), breaks = seq(0, 60, 10))
plt_TotalInquiries
The TotalInquiries variable obivously is positively-skewed distributed.
summary(pld.ana$TotalInquiries)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.000 4.000 5.584 7.000 379.000 1159
# histogram of DelinquenciesLast7Years
range(pld.ana$DelinquenciesLast7Years, na.rm = TRUE)
## [1] 0 99
plt_DelinquenciesLast7Years <- ggplot(aes(DelinquenciesLast7Years), data = subset(pld.ana, !is.na(DelinquenciesLast7Years))) +
geom_histogram(binwidth = 0.5)
plt_DelinquenciesLast7Years
Mojarity of the delinquencies are 0, which means morjarity of loans were paid before due.
summary(pld.ana$DelinquenciesLast7Years)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
# histogram of TotalTrades
range(pld.ana$TotalTrades, na.rm = TRUE)
## [1] 0 126
plt_TotalTrades <- ggplot(aes(TotalTrades), data = subset(pld.ana, !is.na(TotalTrades))) +
geom_histogram(binwidth = 0.5)
plt_TotalTrades
TotalTrades distribution is the similar to the variable of TotalCreditLinespast7years, which means the two variables are very close.
summary(pld.ana$TotalTrades)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 15.00 22.00 23.23 30.00 126.00 7544
# histogram of DebtToIncomeRatio
range(pld.ana$DebtToIncomeRatio, na.rm = TRUE)
## [1] 0.00 10.01
plt_DebtToIncomeRatio <- ggplot(aes(DebtToIncomeRatio), data = subset(pld.ana, !is.na(DebtToIncomeRatio))) +
geom_histogram(binwidth = 0.01)+
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2))
plt_DebtToIncomeRatio
summary(pld.ana$DebtToIncomeRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The median here is 0.22, which is a little less than the mean of 0.276
# histogram of TotalProsperPaymentsBilled
range(pld.ana$TotalProsperPaymentsBilled, na.rm = TRUE)
## [1] 0 141
plt_TotalProsperPaymentsBilled <- ggplot(aes(TotalProsperPaymentsBilled), data = subset(pld.ana, !is.na(TotalProsperPaymentsBilled))) +
geom_histogram(binwidth = 0.5) +
scale_x_continuous(breaks = seq(0, 141, 25))
plt_TotalProsperPaymentsBilled
The plot shows that the numbers of on time payments the borrower made on Prosper loans are mostly 6, 9, 35.
summary(pld.ana$TotalProsperPaymentsBilled)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.00 16.00 22.93 33.00 141.00 91852
# histogram of ProsperPrincipalBorrowed
range(pld.ana$ProsperPrincipalBorrowed, na.rm = TRUE)
## [1] 0 72499
plt_ProsperPrincipalBorrowed <- ggplot(aes(log(ProsperPrincipalBorrowed+1)), data = subset(pld.ana, !is.na(ProsperPrincipalBorrowed))) +
geom_histogram(binwidth = 0.1) +
xlim(6,11)
plt_ProsperPrincipalBorrowed
The mode of the distribution is around 3000.
summary(pld.ana$ProsperPrincipalBorrowed)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3500 6000 8472 11000 72500 91852
# histogram of ProsperPrincipalOutstanding
range(pld.ana$ProsperPrincipalOutstanding, na.rm = TRUE)
## [1] 0.00 23450.95
plt_ProsperPrincipalOutstanding <- ggplot(aes(log(ProsperPrincipalOutstanding+1)), data = subset(pld.ana, !is.na(ProsperPrincipalOutstanding))) +
geom_histogram()
plt_ProsperPrincipalOutstanding
Most of the outstanding of loan are 0.
summary(pld.ana$ProsperPrincipalOutstanding)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 1627 2930 4127 23450 91852
# histogram of LoanOriginalAmount
range(pld.ana$LoanOriginalAmount, na.rm = TRUE)
## [1] 1000 35000
plt_LoanOriginalAmount <- ggplot(aes(log(LoanOriginalAmount+1)), data = subset(pld.ana, !is.na(LoanOriginalAmount))) +
geom_histogram()
plt_LoanOriginalAmount
The values are distributed very unevenly, There are large counts at the values of 4000, 10000 and 15000.
summary(pld.ana$LoanOriginalAmount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
# histogram of MonthlyLoanPayment
range(pld.ana$MonthlyLoanPayment, na.rm = TRUE)
## [1] 0.00 2251.51
plt_MonthlyLoanPayment <- ggplot(aes(log(MonthlyLoanPayment+1)), data = subset(pld.ana, !is.na(MonthlyLoanPayment))) +
geom_histogram(binwidth = .1)
plt_MonthlyLoanPayment
Bacially, all value frequency is under 3500, except the value of arount 140, whose frequecy is upto more than 8000.
summary(pld.ana$MonthlyLoanPayment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2252.0
# histograms with median line
EstimatedReturnMedian <- median(pld.ana$EstimatedReturn, na.rm = TRUE)
TotalCreditLinespast7yearsMedian <- median(pld.ana$TotalCreditLinespast7years, na.rm = TRUE)
TotalInquiriesMedian <- median(pld.ana$TotalInquiries, na.rm = TRUE)
DelinquenciesLast7YearsMedian <- median(pld.ana$DelinquenciesLast7Years, na.rm = TRUE)
TotalTradesMedian <- median(pld.ana$TotalTrades, na.rm = TRUE)
DebtToIncomeRatioMedian <- median(pld.ana$DebtToIncomeRatio, na.rm = TRUE)
TotalProsperPaymentsBilledMedian <- median(pld.ana$TotalProsperPaymentsBilled, na.rm = TRUE)
ProsperPrincipalBorrowedMedian <- median(pld.ana$ProsperPrincipalBorrowed, na.rm = TRUE)
ProsperPrincipalOutstandingMedian <- median(pld.ana$ProsperPrincipalOutstanding, na.rm = TRUE)
LoanOriginalAmountMedian <- median(pld.ana$LoanOriginalAmount, na.rm = TRUE)
MonthlyLoanPaymentMedian <- median(pld.ana$MonthlyLoanPayment, na.rm = TRUE)
plt_EstimatedReturn_MedianLine <- plt_EstimatedReturn + geom_vline(aes(xintercept = EstimatedReturnMedian), col = "royalblue", lwd = 1) +
annotate("text", x = 0.1, y = 2250, label = paste("median: ", EstimatedReturnMedian), colour = "red")
plt_TotalCreditLinespast7years_MedianLine <- plt_TotalCreditLinespast7years + geom_vline(aes(xintercept = TotalCreditLinespast7yearsMedian), col = "royalblue", lwd = 1) +
annotate("text", x = 30, y = 3500, label = paste("median: ", TotalCreditLinespast7yearsMedian), colour = "red")
plt_TotalInquiries_MedianLine <- plt_TotalInquiries + geom_vline(aes(xintercept = TotalInquiriesMedian), col = "royalblue", lwd = 1) +
annotate("text", x = 10, y = 15000, label = paste("median: ", TotalInquiriesMedian), colour = "red")
plt_DelinquenciesLast7Years_MedianLine <- plt_DelinquenciesLast7Years + geom_vline(aes(xintercept = DelinquenciesLast7YearsMedian), col = "royalblue", lwd = 1) +
annotate("text", x = 13, y = 70000, label = paste("median: ", DelinquenciesLast7YearsMedian), colour = "red")
plt_TotalTrades_MedianLine <- plt_TotalTrades + geom_vline(aes(xintercept = TotalTradesMedian), col = "royalblue", lwd = 1) +
annotate("text", x = 35, y = 3500, label = paste("median: ", TotalTradesMedian), colour = "red")
plt_DebtToIncomeRatio_MedianLine <-plt_DebtToIncomeRatio + geom_vline(aes(xintercept = DebtToIncomeRatioMedian), col = "royalblue", lwd = 1) +
annotate("text", x = 0.4, y = 3500, label = paste("median: ", DebtToIncomeRatioMedian), colour = "red")
plt_TotalProsperPaymentsBilled_MedianLine <- plt_TotalProsperPaymentsBilled + geom_vline(aes(xintercept = TotalProsperPaymentsBilledMedian), col = "royalblue", lwd = 1) +
annotate("text", x = 35, y = 1500, label = paste("median: ", TotalProsperPaymentsBilledMedian), colour = "red")
plt_ProsperPrincipalBorrowed_MedianLine <- ggplot(aes(ProsperPrincipalBorrowed), data = subset(pld.ana, !is.na(ProsperPrincipalBorrowed))) +
geom_histogram(binwidth = 500)+
scale_x_continuous(limits = c(0, 70000), breaks = seq(0, 70000, 10000)) +
geom_vline(aes(xintercept = ProsperPrincipalBorrowedMedian), col = "royalblue", lwd = 1) +
annotate("text", x = 10000, y = 2000, label = paste("median: ", ProsperPrincipalBorrowedMedian), colour = "red")
plt_ProsperPrincipalOutstanding_MedianLine <- ggplot(aes(ProsperPrincipalOutstanding), data = subset(pld.ana, !is.na(ProsperPrincipalOutstanding))) +
geom_histogram(binwidth = 500) +
scale_x_continuous(limits = c(0, 20000), breaks = seq(0, 20000,2000)) +
geom_vline(aes(xintercept = ProsperPrincipalOutstandingMedian), col = "royalblue", lwd = 1) +
annotate("text", x = 4500, y = 7000, label = paste("median: ", ProsperPrincipalOutstandingMedian), colour = "red")
plt_LoanOriginalAmount_MedianLine <- ggplot(aes(LoanOriginalAmount), data = subset(pld.ana, !is.na(LoanOriginalAmount))) +
geom_histogram(binwidth = 500) +
scale_x_continuous(limits = c(1000, 35000), breaks = seq(1000, 35000, 5000)) +
geom_vline(aes(xintercept = LoanOriginalAmountMedian), col = "royalblue", lwd = 1) +
annotate("text", x = 9000, y = 10000, label = paste("median: ", LoanOriginalAmountMedian), colour = "red")
plt_MonthlyLoanPayment_MedianLine <- ggplot(aes(MonthlyLoanPayment+1), data = subset(pld.ana, !is.na(MonthlyLoanPayment))) +
geom_histogram(binwidth = 50) +
scale_x_continuous(limits = c(0, 2000), breaks = seq(0, 2000, 400)) +
geom_vline(aes(xintercept = MonthlyLoanPaymentMedian), col = "royalblue", lwd = 1 ) +
annotate("text", x = 380, y = 12000, label = paste("median: ", MonthlyLoanPaymentMedian), colour = "red")
grid.arrange(plt_EstimatedReturn_MedianLine, plt_TotalCreditLinespast7years_MedianLine, plt_TotalInquiries_MedianLine, plt_DelinquenciesLast7Years_MedianLine,
plt_TotalTrades_MedianLine, plt_DebtToIncomeRatio_MedianLine, plt_TotalProsperPaymentsBilled_MedianLine,
plt_ProsperPrincipalBorrowed_MedianLine, plt_ProsperPrincipalOutstanding_MedianLine, plt_LoanOriginalAmount_MedianLine, plt_MonthlyLoanPayment_MedianLine, ncol = 4)
par(mfrow = c(3, 2), mar = c(4, 13, 2, 2))
# histogram of IncomeRange
barplot(table(pld.ana$IncomeRange), horiz = T, las = 2)
title(main = "IncomeRange", cex.main = 2)
# histogram of Occupation
barplot(table(pld.Occupation$Occupation), horiz = T, las = 2)
title(main = "Occupations", cex.main = 2)
# histogram of IsBorrowerHomeowner
barplot(table(pld.ana$IsBorrowerHomeowner), horiz = T, las = 2)
title(main = "IsBorrowerHomeowner", cex.main = 2)
# histogram of ProsperScore
barplot(table(pld.ana$ProsperScore), horiz = T, las = 2)
title(main = "ProsperScore", cex.main = 2)
IncomeRange of $25,000-$49,999 and $50,000-$74,999 have the most frequencies of around 30000. Occupations selected and the IsBorrowerHomeowner categories are basically equally distributed respectively. ProsperScore of 4,6 and 8 are most common in each category of the data.
pld.IncomeRange <- subset(pld.ana, IncomeRange == "$1-24,999" | IncomeRange == "$25,000-49,999" | IncomeRange == "$50,000-74,999" | IncomeRange == "$75,000-99,999" | IncomeRange == "$100,000+")
# EstimatedReturn by IncomeRange
ggplot(aes(EstimatedReturn), data = pld.IncomeRange) +
geom_freqpoly(aes(color = IncomeRange), binwidth = 0.005) +
xlim(0, 0.25)
by(pld.ana$EstimatedReturn, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.0166 0.1132 0.1243 0.1157 0.1360 0.1698 576
## --------------------------------------------------------
## pld.ana$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.1656 0.0878 0.1123 0.1092 0.1271 0.2265 2620
## --------------------------------------------------------
## pld.ana$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.1827 0.0672 0.0827 0.0879 0.1074 0.2667 2132
## --------------------------------------------------------
## pld.ana$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.182 0.080 0.101 0.102 0.124 0.257 8017
## --------------------------------------------------------
## pld.ana$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.168 0.074 0.090 0.095 0.114 0.284 5423
## --------------------------------------------------------
## pld.ana$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.0868 0.0718 0.0872 0.0920 0.1112 0.2570 2418
## --------------------------------------------------------
## pld.ana$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 7741
## --------------------------------------------------------
## pld.ana$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.0105 0.1110 0.1221 0.1194 0.1358 0.2265 157
Among the Income ranges, $25,000-49,999 and $50,000-74,999 have the most frequencies; $75,000-99,999 and $100,000+ come next. Form the plot, we can see that the blue line($50,000-74,999) and green line ($25,000-49,999) have more counts than other ranges; but blue line and green line have different trends as the return goes more than around 0.08; the green line gets more counts than blue line afterwards.
# TotalCreditLinespast7years by IncomeRange
ggplot(aes(TotalCreditLinespast7years), data = pld.IncomeRange) +
geom_freqpoly(aes(color = IncomeRange))
by(pld.ana$TotalCreditLinespast7years, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 11.00 21.00 22.97 31.00 101.00 3
## --------------------------------------------------------
## pld.ana$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 9.00 15.00 17.55 24.00 101.00 7
## --------------------------------------------------------
## pld.ana$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 23.00 32.00 33.12 41.00 136.00 1
## --------------------------------------------------------
## pld.ana$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 15.00 22.00 23.48 30.00 118.00 6
## --------------------------------------------------------
## pld.ana$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 19.00 26.00 27.99 35.00 107.00 1
## --------------------------------------------------------
## pld.ana$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 21.00 29.00 30.48 38.00 124.00
## --------------------------------------------------------
## pld.ana$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 12.00 20.00 22.24 29.00 127.00 677
## --------------------------------------------------------
## pld.ana$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 11.00 18.00 20.26 27.00 71.00 2
The green line ($25,000-49,999) has two modes for total credit lines, which are at around 13 and 25 respectively, and the latter one is also the mode of other ranges except the red line($1-24,999).
# TotalInquiries by IncomeRange
ggplot(aes(TotalInquiries), data = pld.IncomeRange) +
geom_freqpoly(aes(color = IncomeRange)) +
xlim(0, 10)
by(pld.ana$TotalInquiries, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 5.000 7.836 10.000 78.000
## --------------------------------------------------------
## pld.ana$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 4.166 5.000 112.000
## --------------------------------------------------------
## pld.ana$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 5.000 6.045 8.000 90.000
## --------------------------------------------------------
## pld.ana$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 4.845 6.000 117.000
## --------------------------------------------------------
## pld.ana$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 4.000 5.318 7.000 109.000
## --------------------------------------------------------
## pld.ana$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 4.000 5.567 7.000 158.000
## --------------------------------------------------------
## pld.ana$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 4.00 8.00 10.95 14.00 379.00 1159
## --------------------------------------------------------
## pld.ana$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 3.087 4.000 29.000
In this plot, all the ranges have approximate trends.
# TotalTrades by IncomeRange
ggplot(aes(TotalTrades), data = pld.IncomeRange) +
geom_freqpoly(aes(color = IncomeRange), binwidth = 5) +
coord_cartesian(xlim = c(0, 100))
by(pld.ana$TotalTrades, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 17.00 19.03 26.00 86.00
## --------------------------------------------------------
## pld.ana$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 14.27 19.00 83.00
## --------------------------------------------------------
## pld.ana$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 20.00 28.00 29.35 37.00 126.00
## --------------------------------------------------------
## pld.ana$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 12.00 18.00 19.73 26.00 91.00
## --------------------------------------------------------
## pld.ana$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 16.00 23.00 24.03 30.00 103.00
## --------------------------------------------------------
## pld.ana$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 18.00 25.00 26.51 33.00 122.00
## --------------------------------------------------------
## pld.ana$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 11.00 18.00 19.45 26.75 65.00 7543
## --------------------------------------------------------
## pld.ana$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 8.00 14.00 16.26 21.00 65.00 1
The mode of green line ($25,000-49,999) has less trades than that of other ranges except the red line.
# DebtToIncomeRatio by IncomeRange
ggplot(aes(DebtToIncomeRatio), data = pld.IncomeRange) +
geom_freqpoly(aes(color = IncomeRange), binwidth = 0.2) +
coord_cartesian(xlim = c(0, 1.5))
by(pld.ana$DebtToIncomeRatio, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 621
## --------------------------------------------------------
## pld.ana$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.020 0.190 0.320 0.737 0.500 10.010 913
## --------------------------------------------------------
## pld.ana$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1200 0.1700 0.1806 0.2300 10.0100 1266
## --------------------------------------------------------
## pld.ana$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1700 0.2600 0.2789 0.3600 7.9000 2311
## --------------------------------------------------------
## pld.ana$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1600 0.2300 0.2457 0.3200 10.0100 1690
## --------------------------------------------------------
## pld.ana$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.1400 0.2000 0.2137 0.2800 2.5500 901
## --------------------------------------------------------
## pld.ana$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.090 0.160 0.297 0.260 10.010 124
## --------------------------------------------------------
## pld.ana$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.010 0.160 0.295 3.328 10.010 10.010 728
All the ranges have the same mode, but the green line ($25,000-49,999) decreases slower than the blue line ($50,000-74,999).
# TotalProsperPaymentsBilled by IncomeRange
ggplot(aes(TotalProsperPaymentsBilled), data = pld.IncomeRange) +
geom_freqpoly(aes(color = IncomeRange), binwidth = 5) +
coord_cartesian(xlim = c(0, 100))
by(pld.ana$TotalProsperPaymentsBilled, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 14.00 18.39 21.50 81.00 542
## --------------------------------------------------------
## pld.ana$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.00 16.00 22.29 32.00 116.00 6029
## --------------------------------------------------------
## pld.ana$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.00 16.00 22.92 33.00 133.00 13335
## --------------------------------------------------------
## pld.ana$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 9.0 16.0 22.3 33.0 131.0 25808
## --------------------------------------------------------
## pld.ana$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.00 16.00 23.43 34.00 141.00 24521
## --------------------------------------------------------
## pld.ana$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.00 16.00 23.63 34.00 128.00 13198
## --------------------------------------------------------
## pld.ana$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 7741
## --------------------------------------------------------
## pld.ana$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 9.00 13.00 18.41 27.00 70.00 678
All ranges here have similar trends.
# ProsperPrincipalBorrowed by IncomeRange
ggplot(aes(ProsperPrincipalBorrowed), data = pld.IncomeRange) +
geom_freqpoly(aes(color = IncomeRange), binwidth = 1000) +
coord_cartesian(xlim = c(0, 40000))
by(pld.ana$ProsperPrincipalBorrowed, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 2512 5000 8374 10000 40000 542
## --------------------------------------------------------
## pld.ana$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 2000 4000 5147 6400 37000 6029
## --------------------------------------------------------
## pld.ana$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 5000 10000 12230 15500 72500 13335
## --------------------------------------------------------
## pld.ana$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3000 5000 6449 8000 67000 25808
## --------------------------------------------------------
## pld.ana$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 3500 6000 8094 10400 65000 24521
## --------------------------------------------------------
## pld.ana$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 4000 7500 9791 13500 55900 13198
## --------------------------------------------------------
## pld.ana$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 7741
## --------------------------------------------------------
## pld.ana$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 2475 4000 5185 6525 23600 678
We can see here that the green line ($25,000-49,999) has different trends than other ranges, after principal borrowed larger tahn 20000, all ranges except the green one tend to be flatter, just the green one is still noisy.
# ProsperPrincipalOutstanding by IncomeRange
ggplot(aes(ProsperPrincipalOutstanding), data = pld.IncomeRange) +
geom_freqpoly(aes(color = IncomeRange)) +
coord_cartesian(xlim = c(0, 15000))
by(pld.ana$ProsperPrincipalOutstanding, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 941.8 2453.0 4068.0 13280.0 542
## --------------------------------------------------------
## pld.ana$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 1005 1852 2767 20320 6029
## --------------------------------------------------------
## pld.ana$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 2454 4083 6445 23260 13335
## --------------------------------------------------------
## pld.ana$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 1478 2375 3405 21590 25808
## --------------------------------------------------------
## pld.ana$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 1659 2794 3956 23030 24521
## --------------------------------------------------------
## pld.ana$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 1882 3280 5096 23450 13198
## --------------------------------------------------------
## pld.ana$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 7741
## --------------------------------------------------------
## pld.ana$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 1147 2149 3501 13740 678
# LoanOriginalAmount by IncomeRange
ggplot(aes(LoanOriginalAmount), data = pld.IncomeRange) +
geom_freqpoly(aes(color = IncomeRange))
by(pld.ana$LoanOriginalAmount, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 5000 7411 10000 25000
## --------------------------------------------------------
## pld.ana$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2052 4000 4274 5000 25000
## --------------------------------------------------------
## pld.ana$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 6000 12000 13070 18500 35000
## --------------------------------------------------------
## pld.ana$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3000 5000 6178 9800 25000
## --------------------------------------------------------
## pld.ana$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 8675 13500 25000
## --------------------------------------------------------
## pld.ana$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 9700 10370 15000 25000
## --------------------------------------------------------
## pld.ana$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2100 3033 5170 6001 25000
## --------------------------------------------------------
## pld.ana$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4000 4885 6000 25000
# MonthlyLoanPayment by IncomeRange
ggplot(aes(MonthlyLoanPayment), data = pld.IncomeRange) +
geom_freqpoly(aes(color = IncomeRange)) +
coord_cartesian(xlim = c(0, 1300))
by(pld.ana$MonthlyLoanPayment, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 87.14 169.70 267.50 347.60 1131.00
## --------------------------------------------------------
## pld.ana$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 86.38 134.30 154.70 173.70 1048.00
## --------------------------------------------------------
## pld.ana$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 208.6 375.0 412.2 560.1 2252.0
## --------------------------------------------------------
## pld.ana$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 118.9 173.7 210.4 282.8 1382.0
## --------------------------------------------------------
## pld.ana$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 155.8 253.1 280.3 383.7 1778.0
## --------------------------------------------------------
## pld.ana$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 169.5 301.6 329.4 457.2 2112.0
## --------------------------------------------------------
## pld.ana$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 76.62 122.50 182.40 217.80 1048.00
## --------------------------------------------------------
## pld.ana$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 90.28 169.60 183.80 217.70 1086.00
The green line ($25,000-49,999) here decreases rapidly after the mode, from these plots above, we could see that the green line has relative different distribution than other ranges, so let me assume that the income range of $25,000-49,999 has more influence than other ranges.
# EstimatedReturn by Occupation
bplt_EstimatedReturn_by_Occupation <- ggplot(aes(x = Occupation, y = EstimatedReturn), data = subset(pld.Occupation, !is.na(EstimatedReturn))) +
geom_boxplot() +
coord_cartesian(ylim = c(0.05, 0.15))
bplt_EstimatedReturn_by_Occupation
pld.Occupation.EstimatedReturn_by_Occupation <- subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(Occupation)) %>%
group_by(Occupation) %>%
summarise(mean_EstimatedReturn = mean(EstimatedReturn),
median_EstimatedReturn = median(EstimatedReturn),
n = n()) %>%
arrange(Occupation)
pld.Occupation.EstimatedReturn_by_Occupation
## # A tibble: 6 x 4
## Occupation mean_EstimatedReturn median_EstimatedReturn
## <fctr> <dbl> <dbl>
## 1 Accountant/CPA 0.09380908 0.08922
## 2 Administrative Assistant 0.10391261 0.10423
## 3 Computer Programmer 0.08935118 0.08360
## 4 Executive 0.09055953 0.08529
## 5 Sales - Commission 0.09614054 0.09130
## 6 Teacher 0.09582725 0.09130
## # ... with 1 more variables: n <int>
Administrative Assistant has the highest mean and median values with other occupations a little bit less.
# EstimatedReturn by ProsperScore
bplt_EstimatedReturn_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = EstimatedReturn), data = subset(pld.ana, !is.na(EstimatedReturn) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(0.025, 0.15))
bplt_EstimatedReturn_by_ProsperScore
pld.EstimatedReturn_by_ProsperScore <- subset(pld.ana, !is.na(EstimatedReturn) & !is.na(ProsperScore)) %>%
group_by(ProsperScore) %>%
summarise(mean_EstimatedReturn = mean(EstimatedReturn),
median_EstimatedReturn = median(EstimatedReturn),
n = n()) %>%
arrange(ProsperScore)
pld.EstimatedReturn_by_ProsperScore
## # A tibble: 11 x 4
## ProsperScore mean_EstimatedReturn median_EstimatedReturn n
## <fctr> <dbl> <dbl> <int>
## 1 1 0.10513663 0.124600 992
## 2 2 0.10956012 0.110700 5766
## 3 3 0.10455803 0.104065 7642
## 4 4 0.10111778 0.096090 12595
## 5 5 0.10792146 0.108700 9813
## 6 6 0.10490942 0.100100 12278
## 7 7 0.09905789 0.089220 10597
## 8 8 0.08703009 0.078240 12053
## 9 9 0.07571866 0.070100 6911
## 10 10 0.06149748 0.057820 4750
## 11 11 0.05621365 0.053420 1456
Generally, the return is decreasing as Prosper score becomes higher.
# TotalCreditLinespast7years by Occupation
bplt_TotalCreditLinepast7years_by_Occupation <- ggplot(aes(x = Occupation, y = TotalCreditLinespast7years), data = subset(pld.Occupation, !is.na(TotalCreditLinespast7years))) +
geom_boxplot() +
coord_cartesian(ylim = c(10, 45))
bplt_TotalCreditLinepast7years_by_Occupation
pld.Occupation.TotalCreditLinespast7years_by_Occupation <- subset(pld.Occupation, !is.na(TotalCreditLinespast7years)) %>%
group_by(Occupation) %>%
summarise(mean_TotalCreditLinespast7years = mean(TotalCreditLinespast7years),
median_TotalCreditLinespast7years = median(TotalCreditLinespast7years),
n = n()) %>%
arrange(Occupation)
pld.Occupation.TotalCreditLinespast7years_by_Occupation
## # A tibble: 6 x 4
## Occupation mean_TotalCreditLinespast7years
## <fctr> <dbl>
## 1 Accountant/CPA 30.43007
## 2 Administrative Assistant 26.00000
## 3 Computer Programmer 25.51665
## 4 Executive 31.55731
## 5 Sales - Commission 26.57342
## 6 Teacher 30.90205
## # ... with 2 more variables: median_TotalCreditLinespast7years <dbl>,
## # n <int>
Executive and Teacher have the higheset median, with Accoutant/CPA and Administrative Assistant the lowest.
# TotalCreditLinespast7years by ProsperScore
bplt_TotalCreditLinepast7years_by_ProsperScore <-ggplot(aes(x = ProsperScore, y = TotalCreditLinespast7years), data = subset(pld.ana, !is.na(TotalCreditLinespast7years) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(10, 45))
bplt_TotalCreditLinepast7years_by_ProsperScore
pld.TotalCreditLinespast7years_by_ProsperScore <- subset(pld.ana, !is.na(TotalCreditLinespast7years)) %>%
group_by(ProsperScore) %>%
summarise(mean_TotalCreditLinespast7years = mean(TotalCreditLinespast7years),
median_TotalCreditLinespast7years = median(TotalCreditLinespast7years),
n = n()) %>%
arrange(ProsperScore)
pld.TotalCreditLinespast7years_by_ProsperScore
## # A tibble: 12 x 4
## ProsperScore mean_TotalCreditLinespast7years
## <fctr> <dbl>
## 1 1 34.44052
## 2 2 28.73014
## 3 3 28.58872
## 4 4 27.85907
## 5 5 27.61796
## 6 6 27.04781
## 7 7 27.02850
## 8 8 27.08322
## 9 9 26.74591
## 10 10 27.97663
## 11 11 30.14629
## 12 NA 24.05731
## # ... with 2 more variables: median_TotalCreditLinespast7years <dbl>,
## # n <int>
As the Prosper score gets higher, the boxbplot has a shape of being close to a polynomial, which is decreaing first and then increasing.
# TotalInquiries by Occupation
bplt_TotalInquiries_by_Occupation <- ggplot(aes(x = Occupation, y = TotalInquiries), data = subset(pld.Occupation, !is.na(TotalInquiries))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 10))
bplt_TotalInquiries_by_Occupation
pld.Occupation.TotalInquiries_by_Occupation <- subset(pld.Occupation, !is.na(TotalInquiries)) %>%
group_by(Occupation) %>%
summarise(mean_TotalInquiries = mean(TotalInquiries),
median_TotalInquiries = median(TotalInquiries),
n = n()) %>%
arrange(Occupation)
pld.Occupation.TotalInquiries_by_Occupation
## # A tibble: 6 x 4
## Occupation mean_TotalInquiries median_TotalInquiries n
## <fctr> <dbl> <dbl> <int>
## 1 Accountant/CPA 5.617074 4 3233
## 2 Administrative Assistant 5.460266 4 3687
## 3 Computer Programmer 5.674408 4 4478
## 4 Executive 6.264208 5 4311
## 5 Sales - Commission 6.499420 4 3446
## 6 Teacher 5.166002 3 3759
Executive has the highest median, with other occupations equally less.
# TotalInquiries by ProsperScore
bplt_TotalInquiries_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = TotalInquiries), data = subset(pld.ana, !is.na(TotalInquiries) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 15))
bplt_TotalInquiries_by_ProsperScore
pld.TotalInquiries_by_ProsperScore <- subset(pld.ana, !is.na(TotalInquiries)) %>%
group_by(ProsperScore) %>%
summarise(mean_TotalInquiries = mean(TotalInquiries),
median_TotalInquiries = median(TotalInquiries),
n = n()) %>%
arrange(ProsperScore)
pld.TotalInquiries_by_ProsperScore
## # A tibble: 12 x 4
## ProsperScore mean_TotalInquiries median_TotalInquiries n
## <fctr> <dbl> <dbl> <int>
## 1 1 10.718750 9 992
## 2 2 5.834374 5 5766
## 3 3 5.273619 4 7642
## 4 4 4.802541 4 12595
## 5 5 4.301743 3 9813
## 6 6 3.990878 3 12278
## 7 7 3.823535 3 10597
## 8 8 3.482370 3 12053
## 9 9 3.479959 3 6911
## 10 10 3.302526 3 4750
## 11 11 3.811126 3 1456
## 12 NA 9.516383 7 27925
Total inquiries decrease as score increases, and there is a bigger difference between the first 2 than between one another among score 2-11.
# DelinquenciesLast7Years by Occupation
bplt_DelinquenciesLast7Years_by_Occupation <- ggplot(aes(x = Occupation, y = DelinquenciesLast7Years), data = subset(pld.Occupation, !is.na(DelinquenciesLast7Years))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 5))
bplt_DelinquenciesLast7Years_by_Occupation
by(pld.Occupation$DelinquenciesLast7Years, pld.Occupation$Occupation, summary)
## pld.Occupation$Occupation: Accountant/CPA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 3.955 2.000 99.000 1
## --------------------------------------------------------
## pld.Occupation$Occupation: Administrative Assistant
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.748 4.000 99.000 7
## --------------------------------------------------------
## pld.Occupation$Occupation: Computer Programmer
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 2.831 0.000 99.000 3
## --------------------------------------------------------
## pld.Occupation$Occupation: Executive
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 3.553 1.000 99.000 1
## --------------------------------------------------------
## pld.Occupation$Occupation: Sales - Commission
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.792 4.000 99.000 7
## --------------------------------------------------------
## pld.Occupation$Occupation: Teacher
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.798 4.000 99.000 2
Every occupation here has median of 0, but Executive has the smallest variance while Administrative Assistant, Sales - Commission and Teacher have the biggest.
# DelinquenciesLast7Years by ProsperScore
bplt_DelinquenciesLast7Years_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = DelinquenciesLast7Years), data = subset(pld.ana, !is.na(DelinquenciesLast7Years) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 8))
bplt_DelinquenciesLast7Years_by_ProsperScore
by(pld.ana$DelinquenciesLast7Years, pld.ana$ProsperScore, summary)
## pld.ana$ProsperScore: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 6.846 8.000 99.000
## --------------------------------------------------------
## pld.ana$ProsperScore: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 4.996 5.000 99.000
## --------------------------------------------------------
## pld.ana$ProsperScore: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 4.576 4.000 99.000
## --------------------------------------------------------
## pld.ana$ProsperScore: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 4.255 3.000 99.000
## --------------------------------------------------------
## pld.ana$ProsperScore: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 3.934 3.000 99.000
## --------------------------------------------------------
## pld.ana$ProsperScore: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 3.807 2.000 99.000
## --------------------------------------------------------
## pld.ana$ProsperScore: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 3.702 2.000 99.000
## --------------------------------------------------------
## pld.ana$ProsperScore: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.995 1.000 99.000
## --------------------------------------------------------
## pld.ana$ProsperScore: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.327 0.000 99.000
## --------------------------------------------------------
## pld.ana$ProsperScore: 10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.587 0.000 99.000
## --------------------------------------------------------
## pld.ana$ProsperScore: 11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.414 0.000 76.000
Just like other variables, the distribution of delinquencies get smaller as score gets higher.
# TotalTrades by Occupation
bplt_TotalTrades_by_Occupation <- ggplot(aes(x = Occupation, y = TotalTrades), data = subset(pld.Occupation, !is.na(TotalTrades))) +
geom_boxplot() +
coord_cartesian(ylim = c(10, 45))
bplt_TotalTrades_by_Occupation
pld.Occupation.TotalTrades_by_Occupation <- subset(pld.Occupation, !is.na(TotalTrades)) %>%
group_by(Occupation) %>%
summarise(mean_TotalTrades = mean(TotalTrades),
median_TotalTrades = median(TotalTrades),
n = n()) %>%
arrange(Occupation)
pld.Occupation.TotalTrades_by_Occupation
## # A tibble: 6 x 4
## Occupation mean_TotalTrades median_TotalTrades n
## <fctr> <dbl> <dbl> <int>
## 1 Accountant/CPA 25.84001 25 3119
## 2 Administrative Assistant 22.46338 21 3509
## 3 Computer Programmer 22.43504 21 4218
## 4 Executive 28.07990 27 4155
## 5 Sales - Commission 23.35712 22 3195
## 6 Teacher 25.07105 24 3617
Executive has the highest median values with Accoutant/CPA and Administrative Assistant the lowest.
# TotalTrades by ProsperScore
bplt_TotalTrades_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = TotalTrades), data = subset(pld.ana, !is.na(TotalTrades) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(12, 45))
bplt_TotalTrades_by_ProsperScore
pld.TotalTrades_by_ProsperScore <- subset(pld.ana, !is.na(TotalTrades)) %>%
group_by(ProsperScore) %>%
summarise(mean_TotalTrades = mean(TotalTrades),
median_TotalTrades = median(TotalTrades),
n = n()) %>%
arrange(ProsperScore)
pld.TotalTrades_by_ProsperScore
## # A tibble: 12 x 4
## ProsperScore mean_TotalTrades median_TotalTrades n
## <fctr> <dbl> <dbl> <int>
## 1 1 29.39113 28 992
## 2 2 24.39716 23 5766
## 3 3 24.47265 23 7642
## 4 4 23.89162 23 12595
## 5 5 23.74330 22 9813
## 6 6 23.29093 22 12278
## 7 7 23.45485 22 10597
## 8 8 23.59379 22 12053
## 9 9 23.52192 22 6911
## 10 10 25.06589 24 4750
## 11 11 26.87981 26 1456
## 12 NA 20.47827 18 21540
Like the TotalCreditLineslast7years variable, as the Prosper score gets higher, the boxbplot has a shape of being close to a polynomial, which is decreasing first and then increasing, but relatively smaller changes.
# DebtToIncomeRatio by Occupation
bplt_DebtToIncomeRatio_by_Occupation <- ggplot(aes(x = Occupation, y = DebtToIncomeRatio), data = subset(pld.Occupation, !is.na(DebtToIncomeRatio))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 0.5))
bplt_DebtToIncomeRatio_by_Occupation
pld.Occupation.DebtToIncomeRatio_by_Occupation <- subset(pld.Occupation, !is.na(DebtToIncomeRatio)) %>%
group_by(Occupation) %>%
summarise(mean_DebtToIncomeRatio = mean(DebtToIncomeRatio),
median_DebtToIncomeRatio = median(DebtToIncomeRatio),
n = n()) %>%
arrange(Occupation)
pld.Occupation.DebtToIncomeRatio_by_Occupation
## # A tibble: 6 x 4
## Occupation mean_DebtToIncomeRatio median_DebtToIncomeRatio
## <fctr> <dbl> <dbl>
## 1 Accountant/CPA 0.2445767 0.22
## 2 Administrative Assistant 0.3018444 0.25
## 3 Computer Programmer 0.2000212 0.18
## 4 Executive 0.2114236 0.18
## 5 Sales - Commission 0.2577339 0.19
## 6 Teacher 0.3046247 0.26
## # ... with 1 more variables: n <int>
Administrative Assistant and Teacher have the highest median, with Computer Programmer and Executive the lowest.
# DebtToIncomeRatio by ProsperScore
bplt_DebtToIncomeRatio_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = DebtToIncomeRatio), data = subset(pld.ana, !is.na(DebtToIncomeRatio) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(0.05, 0.45))
bplt_DebtToIncomeRatio_by_ProsperScore
pld.DebtToIncomeRatio_by_ProsperScore <- subset(pld.ana, !is.na(DebtToIncomeRatio)) %>%
group_by(ProsperScore) %>%
summarise(mean_DebtToIncomeRatio = mean(DebtToIncomeRatio),
median_DebtToIncomeRatio = median(DebtToIncomeRatio),
n = n()) %>%
arrange(ProsperScore)
pld.DebtToIncomeRatio_by_ProsperScore
## # A tibble: 12 x 4
## ProsperScore mean_DebtToIncomeRatio median_DebtToIncomeRatio n
## <fctr> <dbl> <dbl> <int>
## 1 1 0.4275173 0.33 721
## 2 2 0.3178847 0.27 4822
## 3 3 0.3212492 0.28 6580
## 4 4 0.2945494 0.27 11164
## 5 5 0.2896673 0.25 8776
## 6 6 0.2681130 0.24 11309
## 7 7 0.2382009 0.21 9966
## 8 8 0.2156692 0.19 11543
## 9 9 0.1939321 0.17 6625
## 10 10 0.1765553 0.16 4639
## 11 11 0.2006657 0.19 1412
## 12 NA 0.3238720 0.20 27826
DebtToIncomeRatio is clearly dereasing when score gets higher.
# TotalProsperPaymentsBilled by Occupation
bplt_TotalProsperPaymentsBilled_by_Occupation <- ggplot(aes(x = Occupation, y = TotalProsperPaymentsBilled), data = subset(pld.Occupation, !is.na(TotalProsperPaymentsBilled))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 45))
bplt_TotalProsperPaymentsBilled_by_Occupation
pld.Occupation.TotalProsperPaymentsBilled_by_Occupation <- subset(pld.Occupation, !is.na(TotalProsperPaymentsBilled)) %>%
group_by(Occupation) %>%
summarise(mean_TotalProsperPaymentsBilled = mean(TotalProsperPaymentsBilled),
median_TotalProsperPaymentsBilled = median(TotalProsperPaymentsBilled),
n = n()) %>%
arrange(Occupation)
pld.Occupation.TotalProsperPaymentsBilled_by_Occupation
## # A tibble: 6 x 4
## Occupation mean_TotalProsperPaymentsBilled
## <fctr> <dbl>
## 1 Accountant/CPA 22.45644
## 2 Administrative Assistant 23.30863
## 3 Computer Programmer 23.50436
## 4 Executive 21.88734
## 5 Sales - Commission 20.98701
## 6 Teacher 25.18665
## # ... with 2 more variables: median_TotalProsperPaymentsBilled <dbl>,
## # n <int>
All occupation have similar median values.
# TotalProsperPaymentsBilled by ProsperScore
bplt_TotalProsperPaymentsBilled_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = TotalProsperPaymentsBilled), data = subset(pld.ana, !is.na(TotalProsperPaymentsBilled) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(5, 45))
bplt_TotalProsperPaymentsBilled_by_ProsperScore
pld.TotalProsperPaymentsBilled_by_ProsperScore <- subset(pld.ana, !is.na(TotalProsperPaymentsBilled)) %>%
group_by(ProsperScore)%>%
summarise(mean_TotalProsperPaymentsBilled = mean(TotalProsperPaymentsBilled),
median_TotalProsperPaymentsBilled = median(TotalProsperPaymentsBilled),
n = n()) %>%
arrange(ProsperScore)
pld.TotalProsperPaymentsBilled_by_ProsperScore
## # A tibble: 12 x 4
## ProsperScore mean_TotalProsperPaymentsBilled
## <fctr> <dbl>
## 1 1 24.26814
## 2 2 22.99799
## 3 3 23.15956
## 4 4 24.36944
## 5 5 23.85575
## 6 6 23.56653
## 7 7 23.51542
## 8 8 23.96136
## 9 9 24.11320
## 10 10 26.87716
## 11 11 30.41354
## 12 NA 11.08566
## # ... with 2 more variables: median_TotalProsperPaymentsBilled <dbl>,
## # n <int>
The values don’t have much big difference among all scores.
# ProsperPrincipalBorrowed by Occupation
bplt_ProsperPrincipalBorrowed_by_Occupation <- ggplot(aes(x = Occupation, y = ProsperPrincipalBorrowed), data = subset(pld.Occupation, !is.na(ProsperPrincipalBorrowed))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 15000))
bplt_ProsperPrincipalBorrowed_by_Occupation
pld.Occupation.ProsperPrincipalBorrowed_by_Occupation <- subset(pld.Occupation, !is.na(ProsperPrincipalBorrowed)) %>%
group_by(Occupation) %>%
summarise(mean_ProsperPrincipalBorrowed = mean(ProsperPrincipalBorrowed),
median_ProsperPrincipalBorrowed = median(ProsperPrincipalBorrowed),
n = n()) %>%
arrange(Occupation)
pld.Occupation.ProsperPrincipalBorrowed_by_Occupation
## # A tibble: 6 x 4
## Occupation mean_ProsperPrincipalBorrowed
## <fctr> <dbl>
## 1 Accountant/CPA 9356.625
## 2 Administrative Assistant 7108.740
## 3 Computer Programmer 9547.729
## 4 Executive 11032.300
## 5 Sales - Commission 8523.407
## 6 Teacher 7782.451
## # ... with 2 more variables: median_ProsperPrincipalBorrowed <dbl>,
## # n <int>
Executive has the highest values with Administrative Assistant the lowest.
# ProsperPrincipalBorrowed by ProsperScore
bplt_ProsperPrincipalBorrowed_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = ProsperPrincipalBorrowed), data = subset(pld.ana, !is.na(ProsperPrincipalBorrowed) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 22000))
bplt_ProsperPrincipalBorrowed_by_ProsperScore
pld.ProsperPrincipalBorrowed_by_ProsperScore <- subset(pld.ana, !is.na(ProsperPrincipalBorrowed)) %>%
group_by(ProsperScore) %>%
summarise(mean_ProsperPrincipalBorrowed = mean(ProsperPrincipalBorrowed),
median_ProsperPrincipalBorrowed = median(ProsperPrincipalBorrowed),
n = n()) %>%
arrange(ProsperScore)
pld.ProsperPrincipalBorrowed_by_ProsperScore
## # A tibble: 12 x 4
## ProsperScore mean_ProsperPrincipalBorrowed
## <fctr> <dbl>
## 1 1 6698.976
## 2 2 7436.541
## 3 3 7369.556
## 4 4 7669.675
## 5 5 7765.790
## 6 6 8011.056
## 7 7 8696.934
## 8 8 9117.895
## 9 9 9321.669
## 10 10 11832.318
## 11 11 14536.176
## 12 NA 6012.382
## # ... with 2 more variables: median_ProsperPrincipalBorrowed <dbl>,
## # n <int>
In contrast to previous trend, the principal borrowed increases as score increases, and there is two times difference between score 11 and score 1.
# ProsperPrincipalOutstanding by Occupation
bplt_ProsperPrincipalOutstanding_by_Occupation <- ggplot(aes(x = Occupation, y = ProsperPrincipalOutstanding), data = subset(pld.Occupation, !is.na(ProsperPrincipalOutstanding))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 7500))
bplt_ProsperPrincipalOutstanding_by_Occupation
pld.Occupation.ProsperPrincipalOutstanding_by_Occupation <- subset(pld.Occupation, !is.na(ProsperPrincipalOutstanding)) %>%
group_by(Occupation) %>%
summarise(mean_ProsperPrincipalOutstanding = mean(ProsperPrincipalOutstanding),
median_ProsperPrincipalOutstanding = median(ProsperPrincipalOutstanding),
n = n()) %>%
arrange(Occupation)
pld.Occupation.ProsperPrincipalOutstanding_by_Occupation
## # A tibble: 6 x 4
## Occupation mean_ProsperPrincipalOutstanding
## <fctr> <dbl>
## 1 Accountant/CPA 3179.139
## 2 Administrative Assistant 2750.100
## 3 Computer Programmer 2721.382
## 4 Executive 4171.374
## 5 Sales - Commission 3040.752
## 6 Teacher 2968.234
## # ... with 2 more variables: median_ProsperPrincipalOutstanding <dbl>,
## # n <int>
As usual, Executive has the highest median, but now it gets much higher values than other occupations, which have similar values
# ProsperPrincipalOutstanding by ProsperScore
bplt_ProsperPrincipalOutstanding_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = ProsperPrincipalOutstanding), data = subset(pld.ana, !is.na(ProsperPrincipalOutstanding) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 8000))
bplt_ProsperPrincipalOutstanding_by_ProsperScore
pld.ProsperPrincipalOutstanding_by_ProsperScore <- subset(pld.ana, !is.na(ProsperPrincipalOutstanding)) %>%
group_by(ProsperScore) %>%
summarise(mean_ProsperPrincipalOutstanding = mean(ProsperPrincipalOutstanding),
median_ProsperPrincipalOutstanding = median(ProsperPrincipalOutstanding),
n = n()) %>%
arrange(ProsperScore)
pld.ProsperPrincipalOutstanding_by_ProsperScore
## # A tibble: 12 x 4
## ProsperScore mean_ProsperPrincipalOutstanding
## <fctr> <dbl>
## 1 1 2153.677
## 2 2 2999.903
## 3 3 2925.260
## 4 4 2827.027
## 5 5 2892.634
## 6 6 2777.285
## 7 7 3069.913
## 8 8 2867.231
## 9 9 3009.148
## 10 10 3024.799
## 11 11 3590.115
## 12 NA 3027.456
## # ... with 2 more variables: median_ProsperPrincipalOutstanding <dbl>,
## # n <int>
# LoanOriginalAmount by Occupation
bplt_LoanOriginalAmount_by_Occupation <- ggplot(aes(x = Occupation, y = LoanOriginalAmount), data = subset(pld.Occupation, !is.na(LoanOriginalAmount))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 18000))
bplt_LoanOriginalAmount_by_Occupation
pld.Occupation.LoanOriginalAmount_by_Occupation <- subset(pld.Occupation, !is.na(LoanOriginalAmount)) %>%
group_by(Occupation) %>%
summarise(mean_LoanOriginalAmount = mean(LoanOriginalAmount),
median_LoanOriginalAmount = median(LoanOriginalAmount),
n = n()) %>%
arrange(Occupation)
pld.Occupation.LoanOriginalAmount_by_Occupation
## # A tibble: 6 x 4
## Occupation mean_LoanOriginalAmount
## <fctr> <dbl>
## 1 Accountant/CPA 9195.888
## 2 Administrative Assistant 6598.894
## 3 Computer Programmer 9420.892
## 4 Executive 11890.577
## 5 Sales - Commission 8763.208
## 6 Teacher 7887.450
## # ... with 2 more variables: median_LoanOriginalAmount <dbl>, n <int>
Executive has the highest values with Administrative Assistant the lowest.
# LoanOriginalAmount by ProsperScore
bplt_LoanOriginalAmount_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = LoanOriginalAmount), data = subset(pld.ana, !is.na(LoanOriginalAmount) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 20000))
bplt_LoanOriginalAmount_by_ProsperScore
pld.LoanOriginalAmount_by_ProsperScore <- subset(pld.ana, !is.na(LoanOriginalAmount)) %>%
group_by(ProsperScore) %>%
summarise(mean_LoanOriginalAmount = mean(LoanOriginalAmount),
median_LoanOriginalAmount = median(LoanOriginalAmount),
n = n()) %>%
arrange(ProsperScore)
pld.LoanOriginalAmount_by_ProsperScore
## # A tibble: 12 x 4
## ProsperScore mean_LoanOriginalAmount median_LoanOriginalAmount n
## <fctr> <dbl> <dbl> <int>
## 1 1 4570.955 4000 992
## 2 2 5279.778 4000 5766
## 3 3 7062.552 4500 7642
## 4 4 8401.920 7500 12595
## 5 5 8400.081 7000 9813
## 6 6 9222.604 8000 12278
## 7 7 10097.153 9500 10597
## 8 8 10487.978 10000 12053
## 9 9 10055.976 8300 6911
## 10 10 11742.895 10000 4750
## 11 11 14858.186 15000 1456
## 12 NA 6159.303 4500 29084
There a huge gap between score 1 and score 11. The original amount is getting higher fast as score get higher so that the value’s more than 3 times when it comes to score 11.
# MonthlyLoanPayment by Occupation
bplt_MonthlyLoanPayment_by_Occupation <- ggplot(aes(x = Occupation, y = MonthlyLoanPayment), data = subset(pld.Occupation, !is.na(MonthlyLoanPayment))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 600))
bplt_MonthlyLoanPayment_by_Occupation
pld.Occupation.MonthlyLoanPayment_by_Occupation <- subset(pld.Occupation, !is.na(MonthlyLoanPayment)) %>%
group_by(Occupation) %>%
summarise(mean_MonthlyLoanPayment = mean(MonthlyLoanPayment),
median_MonthlyLoanPayment = median(MonthlyLoanPayment),
n = n()) %>%
arrange(Occupation)
pld.Occupation.MonthlyLoanPayment_by_Occupation
## # A tibble: 6 x 4
## Occupation mean_MonthlyLoanPayment
## <fctr> <dbl>
## 1 Accountant/CPA 297.6405
## 2 Administrative Assistant 224.3001
## 3 Computer Programmer 306.5021
## 4 Executive 378.6373
## 5 Sales - Commission 287.7390
## 6 Teacher 255.0617
## # ... with 2 more variables: median_MonthlyLoanPayment <dbl>, n <int>
As above, Executive has the highest values with Administrative Assistant the lowest.
# MonthlyLoanPayment by ProsperScore
bplt_MonthlyLoanPayment_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = MonthlyLoanPayment), data = subset(pld.ana, !is.na(MonthlyLoanPayment) & !is.na(ProsperScore))) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 600))
bplt_MonthlyLoanPayment_by_ProsperScore
pld.MonthlyLoanPayment_by_ProsperScore <- subset(pld.ana, !is.na(MonthlyLoanPayment)) %>%
group_by(ProsperScore) %>%
summarise(mean_MonthlyLoanPayment = mean(MonthlyLoanPayment),
median_MonthlyLoanPayment = median(MonthlyLoanPayment),
n = n()) %>%
arrange(ProsperScore)
pld.MonthlyLoanPayment_by_ProsperScore
## # A tibble: 12 x 4
## ProsperScore mean_MonthlyLoanPayment median_MonthlyLoanPayment n
## <fctr> <dbl> <dbl> <int>
## 1 1 194.5868 171.410 992
## 2 2 201.7637 166.540 5766
## 3 3 251.5436 174.200 7642
## 4 4 283.4173 252.670 12595
## 5 5 282.8980 246.410 9813
## 6 6 299.7858 270.425 12278
## 7 7 316.4242 290.180 10597
## 8 8 319.9872 287.110 12053
## 9 9 295.2916 251.700 6911
## 10 10 336.3999 309.120 4750
## 11 11 424.0375 402.315 1456
## 12 NA 215.7157 153.800 29084
There a huge gap between score 1 and score 11 as in variable LoanOriginalAmount. And there is neary 3 times difference between score 1 and score 11.
In summary, the occupations of Administrative Assistant and Executive have more influence and the Prosper scores generally have monotonically positive or negative effects.
# scatterplot of TotalCreditLinespast7years and EstimatedReturn
ggplot(aes(x = TotalCreditLinespast7years, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalCreditLinespast7years))) +
geom_point(alpha = 1/20, position = "jitter") +
geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1 ) +
geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1)
# correlation between TotalCreditLinespast7years and EstimatedReturn
cor.test(pld$TotalCreditLinespast7years, pld$EstimatedReturn, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pld$TotalCreditLinespast7years and pld$EstimatedReturn
## t = -10.59, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04304943 -0.02961027
## sample estimates:
## cor
## -0.03633149
The correlation is -0.036, which is a very small value as for the linear relationship between EstimatedReturn and TotalCreditLineslast7years, as we can see from the scatterplot, whose mean, median and quantile lines are almost paralell to the x-axis, and when the credit lines come to the value of 75, the lines become more noisy compared to the previous ones.
# scatterplot of TotalInquiries and EstimatedReturn
ggplot(aes(x = TotalInquiries, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalInquiries))) +
geom_point(alpha = 1/20, position = "jitter") +
geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1) +
geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1)
# correlation between TotalInquiries and EstimatedReturn
cor.test(pld$TotalInquiries, pld$EstimatedReturn, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pld$TotalInquiries and pld$EstimatedReturn
## t = 24.491, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07709580 0.09045827
## sample estimates:
## cor
## 0.0837808
The correlation here is 0.084, which is also very small but positive, meaning that even though the linear relationship between EstimatedReturn and TotalInquiries is very tiny but there is a positive linear relationship which is contrary to the relationship between EstimatedReturn and TotalCreditLineslast7years. ANd more than that, there is a big gap at some point in the plot.
# scatterplot of DelinquenciesLast7Years and EstimatedReturn
ggplot(aes(x = DelinquenciesLast7Years, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(DelinquenciesLast7Years))) +
geom_point(alpha = 1/20, position = "jitter") +
geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1) +
geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1)
# correlation between DelinquenciesLast7Years and EstimatedReturn
cor.test(pld$DelinquenciesLast7Years, pld$EstimatedReturn, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pld$DelinquenciesLast7Years and pld$EstimatedReturn
## t = 27.632, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08776313 0.10110004
## sample estimates:
## cor
## 0.09443582
Most of points here are centered at the x = 0, where delinquency is euqal to 0, and the linear relationship is not big as well, but the distribution is more even compared to previous two.
# scatterplot of TotalTrades and EstimatedReturn
ggplot(aes(x = TotalTrades, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalTrades))) +
geom_point(alpha = 1/20, position = "jitter") +
geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1) +
geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1)
# correlation between TotalTrades and EstimatedReturn
cor.test(pld$TotalTrades, pld$EstimatedReturn, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pld$TotalTrades and pld$EstimatedReturn
## t = -19.168, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07235914 -0.05896024
## sample estimates:
## cor
## -0.06566265
The relationship between EstimatedReturn and TotalTrades is negative, and the points are mostly cencered in some area where return is between 0.05 and 0.13, and trades are between 6 to 37.
# scatterplot of DebtToIncomeRatio and EstimatedReturn
ggplot(aes(x = DebtToIncomeRatio, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(DebtToIncomeRatio))) +
geom_point(alpha = 1/20, position = "jitter") +
geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1) +
geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1)
# correlation between DebtToIncomeRatio and EstimatedReturn
cor.test(pld$DebtToIncomeRatio, pld$EstimatedReturn, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pld$DebtToIncomeRatio and pld$EstimatedReturn
## t = 24.387, df = 77555, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.08024760 0.09421615
## sample estimates:
## cor
## 0.08723617
The points are center in a small area, and the two quantile lines is positively sloped meaning that the two variables are positively related, whose correlation is 0.087, not big, but somehow linear relatd.
# scatterplot of TotalProsperPaymentsBilled and EstimatedReturn
ggplot(aes(x = TotalProsperPaymentsBilled, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalProsperPaymentsBilled))) +
geom_point(alpha = 1/20, position = "jitter") +
geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1) +
geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1)
# correlation between TotalProsperPaymentsBilled and EstimatedReturn
cor.test(pld$TotalProsperPaymentsBilled, pld$EstimatedReturn, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pld$TotalProsperPaymentsBilled and pld$EstimatedReturn
## t = -3.6879, df = 19795, p-value = 0.0002267
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04011842 -0.01227741
## sample estimates:
## cor
## -0.026203
The poins here are not so centerd as above, and the lines are almost horizontal, as the correlation test shows, the correlation between EstimatedReturn and TotalProsperPaymentsBilled is -0.026, which is hardly related.
# scatterplot of ProsperPrincipalBorrowed and EstimatedReturn
ggplot(aes(x = log(ProsperPrincipalBorrowed+1), y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalBorrowed))) +
geom_point(alpha = 1/20, position = "jitter") +
stat_quantile(quantiles = c(0.25, 0.5, 0.75), aes(color = ..quantile..), lwd = 1)
# correlation between ProsperPrincipalBorrowed and EstimatedReturn
cor.test(pld$ProsperPrincipalBorrowed, pld$EstimatedReturn, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pld$ProsperPrincipalBorrowed and pld$EstimatedReturn
## t = -23.885, df = 19795, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1808775 -0.1537977
## sample estimates:
## cor
## -0.1673692
compared to the other variables, EstimatedReturn and ProsperPrincipalBorrowed are much more related, whose corralation is -0.167.
# scatterplot of ProsperPrincipalOutstanding and EstimatedReturn
ggplot(aes(x = log(ProsperPrincipalOutstanding+1), y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalOutstanding))) +
geom_point(alpha = 1/20, position = "jitter") +
stat_quantile(quantiles = c(0.25, 0.5, 0.75), aes(color = ..quantile..), lwd = 1)
# correlation between ProsperPrincipalBorrowed and EstimatedReturn
cor.test(pld$ProsperPrincipalOutstanding, pld$EstimatedReturn, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pld$ProsperPrincipalOutstanding and pld$EstimatedReturn
## t = -6.8638, df = 19795, p-value = 6.905e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06261448 -0.03482048
## sample estimates:
## cor
## -0.04872691
However, the relationship between EstimatedReturn and ProsperPrincipalOutstanding is small, and the most of the points are at the line of x = 0.
# scatterplot of LoanOriginalAmount and EstimatedReturn
ggplot(aes(x = log(LoanOriginalAmount+1), y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(LoanOriginalAmount))) +
geom_point(alpha = 1/20, position = "jitter") +
stat_quantile(quantiles = c(0.25, 0.5, 0.75), aes(color = ..quantile..), lwd = 1)
# correlation between LoanOriginalAmount and EstimatedReturn
cor.test(pld$LoanOriginalAmount, pld$EstimatedReturn, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pld$LoanOriginalAmount and pld$EstimatedReturn
## t = -86.98, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2922833 -0.2799279
## sample estimates:
## cor
## -0.2861175
Like the relationship with the ProsperPrincipalBorrowed, the relationship between EstimatedReturn and LoanOriginalAmount has relative high relationship and negatively related.
# scatterplot of MonthlyLoanPayment and EstimatedReturn
ggplot(aes(x = MonthlyLoanPayment, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(MonthlyLoanPayment))) +
geom_point(alpha = 1/20, position = "jitter") +
stat_quantile(quantiles = c(0.25, 0.5, 0.75), aes(color = ..quantile..), lwd = 2) +
scale_x_continuous(limits = c(0, 1500), breaks = seq(0, 1500, 250))
# correlation between MonthlyLoanPayment and EstimatedReturn
cor.test(pld$MonthlyLoanPayment, pld$EstimatedReturn, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: pld$MonthlyLoanPayment and pld$EstimatedReturn
## t = -76.089, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2590193 -0.2464218
## sample estimates:
## cor
## -0.2527313
The relationship between EstimatedReturn and MonthlyLoanPayment is approimately the same as the relationship between EstimatedReturn and LoanOriginalAmount, negative and relatively high.
# scatterplot of TotalCreditLinespast7years and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = TotalCreditLinespast7years, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalCreditLinespast7years))) +
geom_line()+
geom_smooth()
adding another variable IsBorrowerHomeowner, we can see that in the relationship between EstimatedReturn and TotalCreditLinespast7years, before some point, homeowner leads to slightly lower return than non-homeowner, but after that, homeowner increases the return largely so that there is a clear difference between the two.
# scatterplot of TotalInquiries and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = TotalInquiries, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalInquiries))) +
geom_line()+
geom_smooth() +
ylim(0, 0.2)
# scatterplot of DelinquenciesLast7Years and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = DelinquenciesLast7Years, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(DelinquenciesLast7Years))) +
geom_line() +
geom_smooth()
# scatterplot of TotalTrades and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = TotalTrades, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalTrades))) +
geom_line() +
geom_smooth()
There is the some trend for the relationship between EstimatedReturn that happens in the plot of Estimatedreturn and TotalCreditLinespast7years, just there is a bigger gap at the end in this plot.
# scatterplot of DebtToIncomeRatio and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = DebtToIncomeRatio, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(DebtToIncomeRatio))) +
geom_line() +
geom_smooth()
# scatterplot of TotalProsperPaymentsBilled and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = TotalProsperPaymentsBilled, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalProsperPaymentsBilled))) +
geom_line() +
geom_smooth()
There is more noise for the red line, which could mean that it is not so stable for non-homeowner than for homeowner to pay on time.
# scatterplot of ProsperPrincipalBorrowed and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = ProsperPrincipalBorrowed, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalBorrowed))) +
geom_line() +
geom_smooth()
# scatterplot of ProsperPrincipalOutstanding and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = ProsperPrincipalOutstanding, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalOutstanding))) +
geom_line() +
geom_smooth()
# scatterplot of LoanOriginalAmount and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = LoanOriginalAmount, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(LoanOriginalAmount))) +
geom_line() +
geom_smooth()
# scatterplot of MonthlyLoanPayment and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = MonthlyLoanPayment, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(MonthlyLoanPayment))) +
geom_line() +
geom_smooth()
In general, homeowner or not does not have much influence on the relationship between EstimatedReturn and other variables.
# another two variables Occupation and IncomeRange added on TotalCreditLinespast7years and EstimatedReturn
ggplot(aes(x = TotalCreditLinespast7years, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(TotalCreditLinespast7years)))+
geom_line(stat = "summary", fun.y = median) +
geom_smooth() +
facet_grid(Occupation ~ IncomeRange)
# another two variables Occupation and IncomeRange added on TotalInquiries and EstimatedReturn
ggplot(aes(x = TotalInquiries, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(TotalInquiries))) +
geom_line(stat = "summary", fun.y = median)+
geom_smooth() +
ylim(0, 0.2) +
facet_grid(Occupation ~ IncomeRange)
# another two variables Occupation and IncomeRange added on DelinquenciesLast7Years and EstimatedReturn
ggplot(aes(x = DelinquenciesLast7Years, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(DelinquenciesLast7Years))) +
geom_line(stat = "summary", fun.y = median) +
geom_smooth() +
facet_grid(Occupation ~ IncomeRange)
# another two variables Occupation and IncomeRange added on TotalTrades and EstimatedReturn
ggplot(aes(x = TotalTrades, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(TotalTrades))) +
geom_line(stat = "summary", fun.y = median) +
geom_smooth() +
facet_grid(Occupation ~ IncomeRange)
# another two variables Occupation and DebtToIncomeRatio added on TotalTrades and EstimatedReturn
ggplot(aes(x = DebtToIncomeRatio, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(DebtToIncomeRatio))) +
geom_line(stat = "summary", fun.y = median) +
geom_smooth() +
facet_grid(Occupation ~ IncomeRange)
# another two variables Occupation and TotalProsperPaymentsBilled added on TotalTrades and EstimatedReturn
ggplot(aes(x = TotalProsperPaymentsBilled, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(TotalProsperPaymentsBilled))) +
geom_line(stat = "summary", fun.y = median) +
geom_smooth() +
facet_grid(Occupation ~ IncomeRange)
# another two variables Occupation and ProsperPrincipalBorrowed added on TotalTrades and EstimatedReturn
ggplot(aes(x = ProsperPrincipalBorrowed, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalBorrowed))) +
geom_line(stat = "summary", fun.y = median) +
geom_smooth() +
facet_grid(Occupation ~ IncomeRange)
# another two variables Occupation and ProsperPrincipalOutstanding added on TotalTrades and EstimatedReturn
ggplot(aes(x = ProsperPrincipalOutstanding, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalOutstanding))) +
geom_line(stat = "summary", fun.y = median) +
geom_smooth() +
facet_grid(Occupation ~ IncomeRange)
# another two variables Occupation and LoanOriginalAmount added on TotalTrades and EstimatedReturn
ggplot(aes(x = LoanOriginalAmount, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(LoanOriginalAmount))) +
geom_line(stat = "summary", fun.y = median) +
geom_smooth() +
facet_grid(Occupation ~ IncomeRange)
# another two variables Occupation and MonthlyLoanPayment added on TotalTrades and EstimatedReturn
ggplot(aes(x = MonthlyLoanPayment, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(MonthlyLoanPayment))) +
geom_line(stat = "summary", fun.y = median) +
geom_smooth() +
facet_grid(Occupation ~ IncomeRange)
We can see from these plots, the IncomeRange of $25,000-49,999 is quietest range compared to other ranges, especially the range of $50,000-74,999; on the other hand, occupations of Administrative Assistant and Executive are most noisy.
let’s see some plots again.
ggplot(aes(x = IncomeRange, y = EstimatedReturn), data = pld.IncomeRange) +
geom_boxplot() +
coord_cartesian(ylim = c(0.05, 0.15))
IncomeRanges of $1-24,999 has the highest median return value, but there are much less samples in the data than the range of $25,000-49,999, which got most samples in dataset, so ignoring this range, the range of $25,000-49,999 has the highest median return value.
ggplot(aes(x = Occupation, y = EstimatedReturn), data = subset(pld.Occupation, !is.na(EstimatedReturn))) +
geom_boxplot() +
coord_cartesian(ylim = c(0.05, 0.15))
Administrative causes the highest median return.
p1 <- ggplot(aes(x = LoanOriginalAmount, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(LoanOriginalAmount))) +
geom_line() +
geom_smooth()
p2 <- ggplot(aes(x = MonthlyLoanPayment, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(MonthlyLoanPayment))) +
geom_line() +
geom_smooth()
grid.arrange(p1, p2, ncol = 2)
There is no big difference between homeowner or non-homeowner, but there are relatively high negative relationshiop between EstimatedReturn and LoanOriginalAmount, and between EstimatedReturn and MonthlyLoanPayment.
Combined all the five plots together, we can see that these two particular groups, IncomeRange of $25,000-49,999 has most stable return, whereas Occupation of Administrative has the most unstable return, and at the meantime, it matters if the borrower is a homeowner or not.